ETC5521 Tutorial 9

Going beyond two variables, exploring high dimensions

Author

Prof. Di Cook

Published

16 September 2024

🎯 Objectives

These are exercises in plots to make to explore relationships between multiple variables. You will use interactive scatterplot matrices, interactive parallel coordinate plots and tours to explore the world beyond 2D.

🔧 Preparation

install.packages(c("tidyverse", "cassowaryr", "tourr", "GGally", "plotly", "colorspace", "mulgar"))
  • Open your RStudio Project for this unit, (the one you created in week 1, ETC5521). Create a .qmd document for this weeks activities.

📥 Exercises

Exercise 1: Melbourne housing

  1. Read in a copy of the Melbourne housing data from Nick Tierney’s github repo which is a collation from the version at kaggle. Its fairly large, so let’s start simply, and choose two suburbs to focus on. I recommend “South Yarra” and “Brighton”. (Note: there are a number of missing values. I recommend removing these before making plots.)
mel_houses <- read_csv("https://raw.githubusercontent.com/njtierney/melb-housing-data/master/data/housing.csv") %>%
  dplyr::filter(suburb %in% c("South Yarra", "Brighton")) %>%
  dplyr::filter(!is.na(bedroom2)) %>%
  dplyr::filter(!is.na(bathroom)) %>%
  dplyr::filter(!is.na(price))
  1. Make a scatterplot matrix of price, rooms, bedroom2, bathroom, suburb, type. The order of variables can affect the readability. I advise that the plot will be easier to read if you order them with the numerical variables first, and then the categorical variables. Think about what associations can be seen?
ggpairs(mel_houses, columns=c(4,2,10,11,1,3))

  • Except for price the continuous variables are all discrete. We can still examine the associations. It could be useful to use a jittered scatterplot, but that would require making a special plot function to use in the ggpairs function.
  • There is positive linear association between price, rooms, bedroom2, bathroom, which indicates the bigger the house the higher the price
  • From the boxplots: houses in Brighton tend to be higher priced and bigger than South Yarra, and houses tend to be worth more than apartments or units.
  • From the fluctuation diagram, Brighton tends to have more houses, and South Yarra has more apartments.
  • From the density plot, price has a skewed distribution.
  • There is one big outlier, one house sold for a much higher price. There are a few bivariate outliers, houses with a large number of bathrooms but relatively low price.
# To add jitter
ggpairs(mel_houses, columns=c(4,2,10,11,1,3),
        lower=list(continuous=wrap("points",
                  position=position_jitter(height=0.3, width=0.3))))
  1. Subset the data to South Yarra only. Make an interactive scatterplot matrix of rooms, bedroom2, bathroom and price, coloured by type of property. There is a really high price property. Select this case, and determine what’s special about it – why did it sell for so much? Select the outlier in bedrooms and bathrooms, and examine the other characteristics of this property.
south_yarra <- mel_houses %>% 
  dplyr::filter(suburb=="South Yarra") %>%
  dplyr::select(rooms, bedroom2, bathroom, price, type)
highlight_key(south_yarra) %>%
  ggpairs(aes(color = type), columns = c(4,2,1,3),
                  upper=list(continuous="points")) %>%
  ggplotly() %>%
  highlight("plotly_selected")

This property that has a high price has relatively modest characteristics! It has 4 bedrooms and 2 bathrooms.

# To add jittering
highlight_key(south_yarra) %>%
  ggpairs(aes(color = type), 
          columns = c(4,2,1,3),
          lower=list(continuous=wrap("points",
                  position=position_jitter(height=0.3, width=0.3))),
          upper=list(continuous=wrap("points",
                  position=position_jitter(height=0.3, width=0.3)))) %>%
  ggplotly() %>%
  highlight("plotly_selected") 

Exercise 2: Parkinsons

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson’s disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals (“name” column). The main aim of the data is to discriminate healthy people from those with PD, according to “status” column which is set to 0 for healthy and 1 for PD.

The data is available at The UCI Machine Learning Repository in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column. There are 24 variables in the file, including the persons name in column 1.

The data are originally analysed in: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), ‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’, IEEE Transactions on Biomedical Engineering (to appear).

library(cassowaryr)
# Load the data
data(pk)
  1. How many pairwise plots would you need to look at, to look at all of them?

There are 23 numeric variables in the data set, which would require 253 pairwise plots to be made.

  1. Compute several of the scagnostics (monotonic, outlying, clumpy2) for the first five variables of variables, except for name. (Note: We are using just five for computing speed, but the scagnostics could be calculated on all variables.)
# Compute the scagnostics on the relevant variables
s <- calc_scags_wide(pk[,2:5],
                scags=c("outlying","monotonic",
                        "clumpy2"))
s
# A tibble: 6 × 5
  Var1           Var2         outlying clumpy2 monotonic
  <fct>          <fct>           <dbl>   <dbl>     <dbl>
1 MDVP:Fhi(Hz)   MDVP:Fo(Hz)     0.412   0.756    0.796 
2 MDVP:Flo(Hz)   MDVP:Fo(Hz)     0.172   0.541    0.324 
3 MDVP:Flo(Hz)   MDVP:Fhi(Hz)    0.336   0.671    0.0956
4 MDVP:Jitter(%) MDVP:Fo(Hz)     0.289   0        0.270 
5 MDVP:Jitter(%) MDVP:Fhi(Hz)    0.541   0.764    0.0978
6 MDVP:Jitter(%) MDVP:Flo(Hz)    0.430   0        0.407 
  1. Sort the scagnostics, separately by the values on (i) monotonic (ii) outlying (iii) clumpy2, and plot the pair of variables with the highest values on each.
# Check the results for monotonic
s %>% 
  select(Var1, Var2, monotonic) %>% 
  arrange(desc(monotonic)) 
# A tibble: 6 × 3
  Var1           Var2         monotonic
  <fct>          <fct>            <dbl>
1 MDVP:Fhi(Hz)   MDVP:Fo(Hz)     0.796 
2 MDVP:Jitter(%) MDVP:Flo(Hz)    0.407 
3 MDVP:Flo(Hz)   MDVP:Fo(Hz)     0.324 
4 MDVP:Jitter(%) MDVP:Fo(Hz)     0.270 
5 MDVP:Jitter(%) MDVP:Fhi(Hz)    0.0978
6 MDVP:Flo(Hz)   MDVP:Fhi(Hz)    0.0956
ggplot(data=pk, 
       aes(x=`MDVP:Fhi(Hz)`, y=`MDVP:Fo(Hz)`)) + 
  geom_point() + theme(aspect.ratio = 1)

The top pair of variables on monotonic has a strong positive association with the majority of points, and a few outliers. The top pair of variables on outlying, is also the top pair on clumpy2, and has outliers with some clumpiness in the mass of points with low values.

  1. Make an interactive scatterplot matrix. Browse over it to choose other interesting pairs of variables and make the plots.
# Create an interactive splom
s <- s %>%
  mutate(vars = paste(Var1, Var2))
highlight_key(s) %>%
  GGally::ggpairs(columns = 3:5, mapping = aes(label = vars)) %>%
  ggplotly(tooltip = "all", 
           width=500, 
           height=500) %>%
  highlight("plotly_selected") 
  1. The scagnostics help us to find interesting associations between pairs of variables. However, the problem here is to detect differences between Parkinsons patients and normal patients. How would you go about that? Think about some ideas long the line of scagnostics but look for differences between the two groups.
# One way to examine difference between Parkinsons and healthy
pk_med <- pk %>% 
  select(-name) %>% 
  group_by(status) %>%
  summarise_all(list(median, sd)) %>%
  pivot_longer(
    cols=`MDVP:Fo(Hz)_fn1`:`PPE_fn2`,
    names_to="var", 
    values_to="value") %>%
  separate(var, c("var","stat"), "_") %>%
  mutate(stat = fct_recode(stat,
                           "m"="fn1",
                           "s"="fn2")) %>%
  pivot_wider(names_from=stat,
              values_from=value) %>%
  group_by(var) %>%
  summarise(
    d = (m[status==0]-m[status==1])/sqrt(s[status==0]^2+s[status==1]^2))
pk_med %>% arrange(desc(d)) %>% head()
# A tibble: 6 × 2
  var               d
  <chr>         <dbl>
1 MDVP:Fo(Hz)   0.870
2 HNR           0.647
3 MDVP:Fhi(Hz)  0.518
4 MDVP:Flo(Hz)  0.211
5 NHR          -0.243
6 MDVP:RAP     -0.356
ggplot(pk, aes(x=factor(status), y=`MDVP:Fo(Hz)`)) + 
  geom_boxplot()

Generally we are looking for variables where the differences between the Parkinsons and normal patients are big. You need to measure big, relative to the variance of each group. Doing a two sample t-test for each variable is one approach. Here, I’ve computed the median for each group of patients and compared the difference in medians relative to the pooled standard deviation in each group.

Exercise 3: Challenges

For each of the data sets, c1, …, c7 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).

library(mulgar)
animate_xy(c1)
# four small clusters, two big clusters
# linear dependence
animate_xy(c2) 
# Six spherical clusters
animate_xy(c3)
# tetrahedron with lots of smaller triangles,
# barriers, linear dependence
animate_xy(c4) 
# Four linear connected pieces
animate_xy(c5)
# Spiral in lower dimensional space
# Non-linear and linear dependence
animate_xy(c6)
# Two curved clusters
animate_xy(c7)
# spherical cluster, curve cluster and a lot of noise points

👌 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.